Take-home Exercise 2

Population pyramid based data visualization to show the changes of demographic structure of Singapore by age cohort and gender between 2000-2020 at planning area level.

Sun Shengmei https://www.linkedin.com/in/shengmei-sun-9b262656/?originalSubdomain=sg (SMU MITB)https://scis.smu.edu.sg/master-it-business
2022-02-03

Task: The Changes of Singapore Population Pyramid between 2000-2020

Overview

Population Pyramid or “age-sex pyramid” is a graphical illustration of the distribution of a population by age groups and sex. Males are usually shown on the left and females on the right, and they may be measured in absolute numbers or as a percentage of the total population. The pyramid can be used to visualize the age of a particular population. It is also used in ecology to determine the overall age distribution of a population; an indication of the reproductive capabilities and likelihood of the continuation of a species (Wikipedia).

The Task

In this take-home exercise, I am going to design an age-sex pyramid based data visualization to show the changes of demographic structure of Singapore by age cohort and gender between 2000-2020 at planning area level.

For this task, the data sets entitle Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010 and Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020 are used. These data sets are available at Department of Statistics home page. The data set is available at Department of Statistics home page.

Getting Started

Installing and loading the required libraries

The code chunk below is used to check if the necessary R packages are installed in R. If they have yet, then RStudio will install the missing R package(s). If are already been installed, then they will be launch in R environment.

packages = c('ggiraph', 'plotly', 
             'DT', 'patchwork',
             'gganimate', 'tidyverse',
             'readxl', 'gifski', 'gapminder')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Data Import

The code chunk below imports respopagesextod2000to2010.csv and respopagesextod2011to2020.csv into R environment by using read_csv() function of readr package.

After parsing the worksheet into R, it is a good practice to check the structure and content of the newly tibble data frames in RStudio.

df1 <- read_csv("data/respopagesextod2000to2010.csv") 
df2 <- read_csv("data/respopagesextod2011to2020.csv") 
tbl_df(df1)
# A tibble: 1,040,592 x 7
   PA         SZ        AG     Sex     TOD                   Pop  Time
   <chr>      <chr>     <chr>  <chr>   <chr>               <dbl> <dbl>
 1 Ang Mo Kio Cheng San 0_to_4 Males   HDB 1- and 2-Room ~    20  2000
 2 Ang Mo Kio Cheng San 0_to_4 Males   HDB 3-Room Flats      480  2000
 3 Ang Mo Kio Cheng San 0_to_4 Males   HDB 4-Room Flats      220  2000
 4 Ang Mo Kio Cheng San 0_to_4 Males   HDB 5-Room and Exe~    80  2000
 5 Ang Mo Kio Cheng San 0_to_4 Males   HUDC Flats (exclud~     0  2000
 6 Ang Mo Kio Cheng San 0_to_4 Males   Landed Properties       0  2000
 7 Ang Mo Kio Cheng San 0_to_4 Males   Condominiums and O~     0  2000
 8 Ang Mo Kio Cheng San 0_to_4 Males   Others                  0  2000
 9 Ang Mo Kio Cheng San 0_to_4 Females HDB 1- and 2-Room ~    20  2000
10 Ang Mo Kio Cheng San 0_to_4 Females HDB 3-Room Flats      390  2000
# ... with 1,040,582 more rows
tbl_df(df2)
# A tibble: 984,656 x 7
   PA         SZ                     AG     Sex     TOD      Pop  Time
   <chr>      <chr>                  <chr>  <chr>   <chr>  <dbl> <dbl>
 1 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males   HDB 1~     0  2011
 2 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males   HDB 3~    10  2011
 3 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males   HDB 4~    30  2011
 4 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males   HDB 5~    50  2011
 5 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males   HUDC ~     0  2011
 6 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males   Lande~     0  2011
 7 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males   Condo~    40  2011
 8 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Males   Others     0  2011
 9 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females HDB 1~     0  2011
10 Ang Mo Kio Ang Mo Kio Town Centre 0_to_4 Females HDB 3~    10  2011
# ... with 984,646 more rows

Data Wrangling

Combining the two data frames

As we can see from above, the two data frames have the same column names. We will firstly combine the two data frames into one by rows using rbind.

dfcombined <- rbind(df1, df2)
summary(dfcombined)
      PA                 SZ                 AG           
 Length:2025248     Length:2025248     Length:2025248    
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
     Sex                TOD                 Pop         
 Length:2025248     Length:2025248     Min.   :   0.00  
 Class :character   Class :character   1st Qu.:   0.00  
 Mode  :character   Mode  :character   Median :   0.00  
                                       Mean   :  38.41  
                                       3rd Qu.:  10.00  
                                       Max.   :3160.00  
      Time     
 Min.   :2000  
 1st Qu.:2005  
 Median :2010  
 Mean   :2010  
 3rd Qu.:2015  
 Max.   :2020  

Change the age group name for sorting

We need to sort the values alphanumerically by the age group. The following code chunk changes age group “5_to_9” to “05_to_09” to allow for ggplot default labels to arrange the data according to age group. If not, “5_to_9” would be placed among the 50s group data.

dfcombined <- dfcombined %>%
  mutate(AG = case_when(AG == "0_to_4" ~ "00_to_04",
                        AG == "5_to_9" ~ "05_to_09",
                        TRUE ~ AG) )

Change the sign of number of males to negative

dfcombined <- dfcombined %>%
  mutate(Pop1 = case_when(Sex == "Males" ~ -Pop, TRUE ~ Pop))

Select the relevant variables

dfcombined_selected <- dfcombined %>%
  select(- SZ, - TOD)

Group the population data

dfgrouped_selected <- dfcombined_selected %>%
  group_by(PA, Time, AG, Sex) %>%
  summarise(Pop1=sum(Pop1), .groups = 'keep') %>%
  ungroup()

Chart Plotting

Demographic Structure of Singapore 2000-2020 - gganimate

A static plot was firstly created using ggplot2 bar chart. There are two types of bar charts: geom_bar and geom_col. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is. We can also use geom_bar() and override the default connection between geom_bar() and stat_count() by changing “stat” argument to “identity”.

Next, transition_time() method of gganimate extension is added on top of the static plot to determine the behavior of the animation. The transition should happen across time. Linear easing is chosen to define the pace of change.

The full code chunk and output are shown below.

ggplot(data=dfgrouped_selected,
       aes(x = AG, y = Pop1, fill = Sex)) +
  geom_col() +
  xlab("Age") +
  scale_y_continuous(name = "Population",
                     breaks = c(seq(-160000, 160000, 20000)),
                     labels = paste0(as.character(c(seq(160, 0, -20), seq(20, 160, 20))), "K")) +
  coord_flip() +
  theme_bw() +
  labs(title = 'Demographic Structure of Singapore 2000-2020\n\n {as.integer(frame_time)}',
       caption = '\n\n Data Source: https://www.singstat.gov.sg/') +
  theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5)) +
  transition_time(Time) +
  ease_aes('linear')

From above animation, we can see that Singapore population have been aging from 2000 to 2020 for both males and females as the Pyramid center moves higher with time passing by. There are more female elderly than male elderly.

Demographic Structure of Singapore by Planning Area 2000-2020 - ggplotly

In this section, we are going to zoom into more details to see how the demographic structure varies by planning area.

p <- dfgrouped_selected %>%
  ggplot(aes(x = AG, y = Pop1, fill = Sex, frame = Time)) +
  geom_col(position = "identity", aes(text = paste("Age: ", AG, "\nPop: ", abs(Pop1), 
                                                   "\nSex: ", Sex, "\nYear: ", Time))) +
  xlab("Age") +
  coord_flip() +
  scale_y_continuous(name = "Population",
                     breaks = c(seq(-16000, 16000, 2000)),
                     labels = paste0(as.character(c(seq(16, 0, -2), seq(2, 16, 2))), "K")) +
  facet_wrap(~PA, ncol=2,scale = "free_x") +
  theme_bw() +
  theme(legend.position = "none") +

  ggtitle("Singapore Population Pyramid by Planning Area 2000-2020\n\n") +
  theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5, vjust = -4))

ggplotly(p, height = 10000, width = 800, tooltip = "text") 

Given the size of number of planning areas, let’s focus our analysis on the top5 populated areas of 2000 and 2020.

data_2000 <- dfcombined_selected %>%
  filter(Time == "2000")

data_2000_grouped <- data_2000 %>%
  group_by(PA) %>%
  summarise(`TotalPop` = sum(Pop)) %>%
  ungroup()

Top5_2000 <- data_2000_grouped %>%
  arrange(desc(TotalPop))

head(Top5_2000)
# A tibble: 6 x 2
  PA          TotalPop
  <chr>          <dbl>
1 Bedok         285440
2 Tampines      253320
3 Jurong West   204900
4 Hougang       204520
5 Woodlands     188870
6 Ang Mo Kio    181100

Top 5 populated areas of 2000 were “Bedok”, “Tampines”, “Jurong West”, “Hougang”, and “Woodlands”.

data_2020 <- dfcombined_selected %>%
  filter(Time == "2020")

data_2020_grouped <- data_2020 %>%
  group_by(PA) %>%
  summarise(`TotalPop` = sum(Pop)) %>%
  ungroup()

Top5_2020 <- data_2020_grouped %>%
  arrange(desc(TotalPop))

head(Top5_2020)
# A tibble: 6 x 2
  PA          TotalPop
  <chr>          <dbl>
1 Bedok         277720
2 Jurong West   263050
3 Tampines      260380
4 Woodlands     255350
5 Sengkang      249670
6 Hougang       228130

Top 5 populated areas of 2020 were “Bedok”, “Jurong West”, “Tampines”, “Woodlands” and “Sengkang”.

dfgrouped_selected1 <- dfgrouped_selected %>%
  filter(PA ==  "Bedok" | PA == "Jurong West" | PA == "Tampines" |
           PA == "Woodlands" | PA =="Sengkang" | PA == "Hougang")

p1 <- dfgrouped_selected1 %>%
  ggplot(aes(x = AG, y = Pop1, fill = Sex, frame = Time)) +
  geom_col(position = "identity", aes(text = paste("Age: ", AG, "\nPop: ", abs(Pop1), 
                                                   "\nSex: ", Sex, "\nYear: ", Time))) +
  xlab("Age") +
  coord_flip() +
  scale_y_continuous(name = "Population",
                     breaks = c(seq(-16000, 16000, 2000)),
                     labels = paste0(as.character(c(seq(16, 0, -2), seq(2, 16, 2))), "K")) +
  facet_wrap(~PA, ncol=2,scale = "free_x") +
  theme_bw() +
  theme(legend.position = "none") +

  ggtitle("Singapore Population Pyramid by Planning Area 2000-2020\n\n") +
  theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5, vjust = -4))

ggplotly(p1, height = 1500, width = 800, tooltip = "text") 

For the plot above, we have below observations:

Young and Aged Adults Share by Planning Area - gganimate

There are various method to define young and aged adult. For the purpose of this study, we define young adults to be between 20 and 39, based on Erikson’s research and define aged adults to be older than 65 (Wikipedia).

dfcombined_selected1 <- dfcombined_selected %>%
  select(- Pop1)

df3<- dfcombined_selected1 %>%
  mutate(Group = case_when(AG == "20_to_24" |AG == "25_to_29"|AG == "30_to_34"|AG == "35_to_39" ~ "Young",
                           AG == "65_to_69" |AG == "70_to_74"|AG == "75_to_79"|AG == "80_to_84"|
                             AG == "85_to_89"|AG == "90_and_over" ~ "Old",
                           TRUE ~ "Others")) %>%
  select(- AG, -Sex)%>%
  group_by(Time, PA, Group) %>%
  summarise(GroupPop = sum(Pop))%>%
  ungroup() %>%
  pivot_wider(names_from = Group, values_from = GroupPop) %>%
  mutate (Total = Old+Others+Young) %>%
  mutate (Pct_y = Young/Total*100) %>%
  mutate (Pct_o = Old/Total*100)

df3
# A tibble: 1,155 x 8
    Time PA                       Old Others Young  Total Pct_y  Pct_o
   <dbl> <chr>                  <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>
 1  2000 Ang Mo Kio             14430 107010 59660 181100  32.9   7.97
 2  2000 Bedok                  22960 173840 88640 285440  31.1   8.04
 3  2000 Bishan                  6170  56150 27960  90280  31.0   6.83
 4  2000 Boon Lay/Pioneer           0      0     0      0 NaN   NaN   
 5  2000 Bukit Batok             6140  77400 43140 126680  34.1   4.85
 6  2000 Bukit Merah            18720  82820 47330 148870  31.8  12.6 
 7  2000 Bukit Panjang           4850  58290 33620  96760  34.7   5.01
 8  2000 Bukit Timah             5250  40640 18700  64590  29.0   8.13
 9  2000 Central Water Catchme~     0      0     0      0 NaN   NaN   
10  2000 Changi                    30    640   380   1050  36.2   2.86
# ... with 1,145 more rows

Next, we used the prepared data frame df3 to plot the animated bubble plot to show the evolution of % Young versus % Aged population.

ggplot(df3, aes(x = Pct_o, y = Pct_y, 
                      size = Total, 
                      colour = PA)) +
  geom_point(alpha = 0.7, 
             show.legend = FALSE) +
  scale_size(range = c(2, 15)) +
  theme_bw() +
  labs(title = 'Singapore Young and Aged Adults Share\n\n {as.integer(frame_time)}',
       caption = '\n\n Data Source: https://www.singstat.gov.sg/',
       x = '% Aged', 
       y = '% Young') +
  theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5)) +
  transition_time(Time) +
  ease_aes('linear')

As we can see from above chart, % Aged increased and % Young decreased from 2000 to 2020. The decrease of % Young from 2010 onwards is more obvious, compared to that from 2000 to 2009.

Conclusion

Singapore is an aging population for both males and females. There are more female elderly than male elderly. Mature residential areas are representative of overall Singapore demographics changes over time. Recently developed planning areas, for example Punggol and Sengkang, have more young and middle-aged population than the matured planning areas.